18 Montando o Modelo Completo (Estilo GPT)
Integração de todos os componentes desenvolvidos anteriormente para criar uma arquitetura Decoder-Only funcional pronta para gerar texto.
19 Assembling the Complete Model (GPT Style)
This chapter details the architectural integration of a Decoder-Only Transformer, the design paradigm behind models like GPT-2, GPT-3, and Llama. Unlike the original Transformer (which includes an Encoder for translation tasks), the GPT style is designed purely for autoregressive sequence modeling—predicting the next token based on previous context.
19.1 1. High-Level Architecture
The GPT architecture consists of a learned embedding layer, a stack of identical “Decoder Blocks,” and a final projection head to map internal representations back to the vocabulary space.
19.1.1 Architectural Diagram
The following diagram illustrates the data flow from input tokens to output logits. Note the Pre-Layer Normalization arrangement, which is standard in modern GPT implementations (e.g., GPT-2/3) for improved training stability compared to the original Post-Norm formulation.
graph TD
subgraph Inputs
I[Input Token IDs] --> TE[Token Embeddings]
P[Position IDs] --> PE[Positional Embeddings]
end
TE & PE --> Sum((+))
Sum --> Drop1[Dropout]
subgraph "Transformer Block (Repeated N times)"
Drop1 --> LN1[Layer Norm 1]
LN1 --> MSA[Masked Multi-Head Attention]
MSA --> Res1((+))
Drop1 --> Res1
Res1 --> LN2[Layer Norm 2]
LN2 --> MLP[Feed-Forward Network<br/>(GELU Activation)]
MLP --> Res2((+))
Res1 --> Res2
end
Res2 --> LNF[Final Layer Norm]
LNF --> Head[Linear Head<br/>(Project to Vocab Size)]
Head --> Logits[Logits]
Logits --> Soft[Softmax]
Soft --> Prob[Next Token Probabilities]
style Inputs fill:#f9f,stroke:#333,stroke-width:2px
style MSA fill:#bbf,stroke:#333,stroke-width:2px
style MLP fill:#bbf,stroke:#333,stroke-width:2px
style Head fill:#bfb,stroke:#333,stroke-width:2px
19.2 2. Component Integration
To assemble the model, we integrate the components defined in previous chapters. The architecture is defined by hyperparameters: \(d_{model}\) (embedding dimension), \(N_{layers}\) (depth), \(N_{heads}\) (attention heads), and \(V\) (vocabulary size).
19.2.1 A. The Embedding Layer
The entry point of the model. It converts discrete token indices into dense vectors. * Token Embeddings: A lookup table of size \((V, d_{model})\). * Positional Embeddings: A lookup table of size \((ContextLen, d_{model})\). In GPT, these are typically learned parameters rather than fixed sinusoidal functions. * Combination: The two embeddings are summed element-wise.
19.2.2 B. The Decoder Block (The “Heart”)
The bulk of the computation happens here. A single block contains: 1. Layer Normalization (Pre-Norm): Applied before the sub-layers. This allows gradients to flow directly through the residual path, mitigating vanishing gradients in deep networks. 2. Masked Multi-Head Self-Attention (MHA): Allows the model to relate tokens to one another. The causal mask (upper triangular matrix of \(-\infty\)) ensures position \(t\) can only attend to positions \(0\) to \(t\), preserving the autoregressive property. 3. Residual Connection 1: Adds the input of the block to the output of the attention mechanism. 4. Feed-Forward Network (MLP): A two-layer perceptron that expands the dimension (usually \(4 \times d_{model}\)) and projects it back, typically using GeLU (Gaussian Error Linear Unit) activation. 5. Residual Connection 2: Adds the output of the first residual connection to the output of the MLP.
19.2.3 C. The Output Head
After passing through \(N\) blocks: 1. Final Layer Norm: Stabilizes the final hidden states. 2. Linear Projection: Maps the \(d_{model}\) vector back to \(V\) (vocabulary size). 3. Weight Tying (Optional but Common): Often, the weights of the Output Head are shared with the input Token Embeddings to reduce parameter count and improve semantic coherence.
19.3 3. Implementation Logic
Below is a structural representation of the complete model using a PyTorch-like class structure. This demonstrates how the components interact in code.
import torch
import torch.nn as nn
from torch.nn import functional as F
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# 1. Embeddings
self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd)
self.position_embedding = nn.Embedding(config.block_size, config.n_embd)
self.drop = nn.Dropout(config.dropout)
# 2. Stack of Decoder Blocks
# Assuming 'Block' is a class defined in previous chapters
self.blocks = nn.ModuleList([
Block(config) for _ in range(config.n_layer)
])
# 3. Final Normalization
self.ln_f = nn.LayerNorm(config.n_embd)
# 4. Output Head
self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# Weight tying: embedding weights == output head weights
self.token_embedding.weight = self.head.weight
self.apply(self._init_weights)
def _init_weights(self, module):
""" Initialize weights (typically normal distribution with small std) """
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
# idx shape: (Batch, Seq_Len)
B, T = idx.shape
# Create position indices: [0, 1, 2, ..., T-1]
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
# Forward pass through embeddings
tok_emb = self.token_embedding(idx) # (B, T, n_embd)
pos_emb = self.position_embedding(pos) # (T, n_embd)
x = self.drop(tok_emb + pos_emb)
# Forward pass through transformer blocks
for block in self.blocks:
x = block(x)
# Final Norm
x = self.ln_f(x)
# Project to vocabulary
logits = self.head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
# Flatten for CrossEntropyLoss
# logits: (B*T, vocab_size), targets: (B*T)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss19.4 4. The Forward Pass vs. Inference Generation
It is crucial to distinguish how this model operates during training versus generation.
19.4.1 Training (Parallel)
During training, we have access to the entire ground-truth sequence. 1. Input: The sequence [A, B, C]. 2. Target: The sequence [B, C, D]. 3. Masking: The causal mask inside the Attention mechanism allows the model to process all tokens simultaneously. The model predicts B given A, C given AB, and D given ABC in a single forward pass. 4. Loss: Cross-Entropy loss is calculated on all positions simultaneously.
19.4.2 Inference (Sequential)
During generation, we do not know the future. 1. Input: [A]. 2. Step 1: Model outputs logits for the next token. We sample (e.g., using temperature or top-k) to get B. 3. Append: New input is [A, B]. 4. Step 2: Model processes [A, B] to predict C. 5. Repeat: This continues until a special <EOS> (End of Sequence) token is generated or a length limit is reached.
19.5 5. Summary of Integration
The “GPT Style” model is a streamlined, efficient architecture optimized for text generation. By stacking Decoder blocks with Pre-Layer Normalization and Causal Masking, the model learns to compress vast amounts of textual data into its weights, allowing it to act as a sophisticated next-token predictor. The integration of embeddings, attention heads, and MLPs creates a differentiable path from raw text to semantic understanding and back to text.